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ABSTRACT 


Some  of  the  requirements  of  a  computer  system  for  searching 
chemical  compound  structure  analogs  are  reviewed  and  algorithms 
are  offered  where  appropriate.  Among  the  factors  discussed  are 
file  organization  and  data  elements,  the  use  of  screens,  the  question 
of  maximum  query  volume  on  a  single  pass  of  a  master  file,  and  the 
question  of  canonical  forms.  Of  particular  not  are  the  experimental 
algorithms  developed  for  the  canonical  ordering  of  finite  undirected 
graphs  permitting  rapid  determination  of  isomorphism  between  graphs 
to  be  accomplished. 
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s«c»imi.  introduction 


In  any  comprehensive  computerized  system  for  the  performance 
of  whole  compound  and  fragment  searches  of  a  large  file  of  chemical 
compound  structure  analogs,  there  are  several  major  factors  which 
must  be  carefully  evaluated.  Among  these  are: 

1.  The  basic  file  organization  and  data  elements. 

2.  The  number,  kinds,  and  fineness  of  screens  employed. 

3.  The  number  of  fragment  queries  which  may  be  processed 
on  one  pass  of  the  master  file. 

4.  The  desirability  and  feasibility  of  canonical  forms  for 
each  compound  structure  analog. 

A  so-called  "search  algorithm"  cannot  effectively  be  developed  without 
taking  these  factors  into  account.  Because  of  this,  a  search  algorithm 
will  of  necessity  be  based  on  the  resolution  of  these  questions  and  will 
itself  embody  many  algorithms.  Some  of  these  would  be  concerned 
with  the  achievement  of  canonical  forms,  others  with  the  logic  of  frag¬ 
ments  searches,  and  still  others  with  the  application  of  screens.  Each 
of  the  above  factors  will  be  discussed  and  algorithms  offered  where 
appropriate.  In  this  manner,  the  reader  may  choose  or  discard 
algorithms  as  he  sees  fit  or  recombine  them  to  suit  the  peculiar  de¬ 
mands  of  a  specific  requirement. 


SMtion  II.  FILE  ORGANIZATION  AND  DATA  ELEMENTS 


Our  system  is  postulated  on  the  input  availability  of  the  following 
minimum  elements  of  information  for  each  compound: 

1.  BATCH  Number* 

2.  Molecular  Formula  (optional,  if  not  present  it  will  be 
generated  from  the  connectivity  table.  If  present,  it  will 
be  used  as  a  redundancy  check). 

3.  Arbitrarily  numbered  connectivity  table  containing  the 
following  elements  of  information; 

Atom  Atom  Connectivity  and  Bond 

Number  Qualification  Qualification 


01. 

XX 

NNBB, 

NNBB, 

NNBB, 

02. 

XX 

NNBB, 

NNBB 

03. 

XX 

NNBB, 

NNBB, 

NNBB. 

(Where  NN  is  the  Number 
of  the  connected-to  atom 
and  BB  is  the  value  of  the 
Bond) 

From  the  above  information,  the  following  information  will  be 
generated: 

1 .  Bond  Summary* 

2.  Penny  Connectivity  Code  for  each  atom^ 

3.  The  Level  Coincidence  Number  if  required  (See  Section 
"The  Question  of  Canonical  Form"). 

Each  compound  will  be  canonically  ordered  at  input  time  and  will 
be  maintained  in  that  form  thereafter.  The  general  logic  for  this 
operation  is  shown  in  Figure  1;  the  total  Master  Record  Data  elements 
are  shown  in  Figure  2. 

One  of  the  basic  decisions  that  must  be  made  for  any  large-scale 
serial  file  is  the  basic  order  or  orders  of  that  file,  and  this  decision 


♦  Bond  Summary  is  a  tabulation  of  sums  of  each  bond  type;  the 
bonds  in  the  Bond  Summary  bear  the  same  relation  to  Bond  Summary 
as  the  atoms  in  a  compound  bear  to  the  molecular  formula. 
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IF' 

S' 


* 


ACCOMPLISH  FURTHER  ORDERING  BY  RANKING 
TIED  ATOMS  BY  THEIR  CONNECTEO-TO  SETS. 
MEMBER  ATOMS  OF  THE  CONNECTEO-TO  SETS 
WILL  BE  ORDERED  WITHIN  EACH  SET  BY  THE 
SAME  ATOM  ORDER  AS  SPECIFIED  ABOVE. 


NO 


Figure  1 .  General  Logic  Diagram  for  Canonically  Ordering  Structural 
Analogs  of  Chemical  Compounds 
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is  subject  to  many  considerations.  ’  Although  many  large  printed 
compendia  of  chemical  compound  data  are  ordered  by  molecular 
formula  or  certain  gross  structural  characteristics,  magnetic  tape 
files  should  not  necessarily  follow  such  an  order.  Wiswesser  has 
shown  that  the  distinctions  afforded  by  molecular  formula  do  not 
sufficiently  distinguish  the  large  majority  of  chemical  compounds.^ 

The  need  to  search  a  file  and  to  avoid  as  much  as  possible  the  detailed 
examination  of  each  compound  suggests  a  careful  choice  in  the  defini¬ 
tive  index  which  must  be  appended  to  each  compound.  Such  an  index 
is  found  in  the  Wiswesser  BATCH  number.  This  does  not,  however, 
preclude  the  need  for  the  molecular  formula.  In  light  of  these  and 
other  considerations,  we  have  constructed  a  control  word  for  each 
compound  (in  the  order  presented)  consisting  of  the  BATCH  Number, 
Molecular  Formula,  and  Bond  Summary. 

The  Master  Compound  File  will  be  in  major  sequence  by  BATCH 
Number  and  within  that  by  Molecular  Formula  and  Bond  Summary. 

The  general  logic  for  file  maintenance  and  v/hole  compound  searching 
is  outlined  in  Figure  3.  For  fragment  searches,  the  general  logic  is 
outlined  in  Figure  4. 


Figure  3.  Algorithm  for  File  Maintenance  and  Whole  Compound  Searching 


Figure  4.  Algorithm  for  Serial  Processing  of  Fragment  Queries 


Section  III.  SCREENS 


The  basic  purpose  of  a  screen  is  to  preclude  the  detailed  exanaina- 
tion  of  each  compound  during  a  search.  As  such,  the  utilization  of 
screens  in  a  serial  application  is  intimately  bound  up  with  the  question 
of  file  sequence.  In  addition,  the  use  of  multiple  screens  offers  a 
powerful  tool  in  avoiding  detailed  searching  when  possible.  A  careful 
balance  must  be  established  empirically  between  the  added  time  ex¬ 
pended  in  screening  compounds  versus  the  time  saved  by  avoiding  * 

detailed  search.  It  is  felt  that  considerable  experimentation  must  be 
accomplished  to  determine  the  proper  usage  of  screens.  » 

Because  of  the  proposed  organization  of  our  files  using  a  single 
control  word  consisting  of  BATCH  Number,  Molecular  Formula,  and 
Bond  Summary,  a  detailed  examination  of  a  given  compound  is  required 
during  whole  compound  searching  only  when  it  has  the  same  control 
word  as  the  query  compound.  When  this  is  the  case,  a  more  detailed 
examination  will  be  made  utilizing  the  Penny  connectivity  codes.  The 
complete  logic  for  whole  compound  matching  is  shown  in  Figure  3. 

A  different  approach  is  required  for  fragment  searching.  The 
fragment  control  word  must  be  in  all  cases  either  identical  to  or 
imbedded  in  the  master  file  control  word  before  a  detailed  examination 
is  required.  The  logic  for  fragment  searching  is  shown  in  Figure  4. 


8 


SMtion  IV.  QUERY  LIMITS 


It  is  proposed  that  file  maintenance  and  unlimited  whole  compound 
searching  be  accomplished  simultaneously  as  shown  in  Figure  3.  A 
separate  operation  is  proposed  for  fragment  searching,  although  there 
is  no  problem  in  incorporating  whole  compound  searching  with  frag¬ 
ment  searching  if  such  is  desired, 

•  The  question  of  how  many  fragment  queries  may  be  accomplished 

on  a  given  pass  of  the  master  file  is  a  serious  one.  In  the  terminology 

^  of  information  retrieval,  the  question  (for  a  serial  operation)  is  the 

choice  of  a,  linear  file  versus  an  inverted  file.  Although  the  use  of 
inverted  files  places  no  limit  on  the  number  of  queries  that  may  be 
processed  on  one  pass  of  the  master  file,  the  inversion  of  each  chem¬ 
ical  structure  in  terms  of  all  its  possible  fragments  reaches  impracti¬ 
cal  proportions.  On  the  other  hand,  the  use  of  a  linear  file  limits  the 
number  of  queries  as  a  function  of  t  size  of  its  fast-accessible  store 
(memory,  drum,  etc.  ).  If  the  number  of  queries  exceeds  the  limit  of 
the  fast-accessible  store,  the  queries  must  be  placed  on  tape  and  the 
tape  "seesawed"  back  and  forth  for  each  compound  in  the  master  file. 
At  the  present  time,  it  is  felt  that  a  linear  search  should  be  accom¬ 
plished  with  every  effort  expended  to  accommodate  the  largest  number 
of  queries.  It  was  on  this  basis  that  the  algorithms  in  Figures  3  and 
4  were  developed. 
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Section  V.  THE  QUESTION  OF  CANONICAL  FORMS 


In  many  computerized,  atom-by-atom,  bond-by-bond  searches 
(such  as  those  employed  by  DuPont  and  the  Chemical  Abstracts  Service), 
efforts  have  been  directed  toward  the  achievement  of  canonical  forms 
for  chemical  compound  structure  analogs  such  that  for  each  unique 
compound  there  is  one  and  only  one  canonical  form.  This  is  usually 
accomplished  by  subjecting  arbitrarily  numbered  structure  analogs 
to  a  rigorous  ordering  procedure  utilizing  atom  and  bond  qualifica¬ 
tions  and  connectivity  data  as  ordering  criteria.  The  result  is  a  unique 
list  or  matrix  for  each  compound  which  is  tantamount  to  a  formal 
renumbering  of  the  atoms  of  the  subject  compound. 

The  use  of  a  canonical  order  greatly  facilitates  the  process  of 
whole  compound  matching.  The  same  is  not  true  for  fragment  search¬ 
ing,  since  the  order  accorded  a  fragment  is  not  the  same  order 
accorded  that  fragment  when  it  is  a  member  of  a  large  set.  The  eval¬ 
uation  of  a  total  compound  is  accomplished  in  terms  of  all  the  member 
atoms,  bonds,  and  connectivities  of  the  subject  compound.  Thus 
"backtracking,"  which  is  the  nemesis  of  fragment  searching  techniques, 
is  required  in  many  cases. 

The  question  as  to  whether  or  not  one  may  always  achieve  a 
canonical  form  for  each  graph  analog  of  a  compound  appears  still  to 
be  undecided.  The  presentation  of  counterexamples,  such  as  the 
arbitrarily  numbered  planar  graph  in  Figure  5  devised  by  Dr.  Lehman 
of  the  Walter  Reed  Army  Institute  of  Research,  have  necessitated 
revisions  to  canonical  ordering  techniques  such  as  those  used  by 
DuPont  and  the  Chemical  Abstracts  Service. 


In  light  of  this  and  other  similar  examples,  it  appears  that  a  rigorous 
formal  treatment  is  needed  to  decide  whether  the  achievement  of 
canonical  forms  of  graph  analogs  of  chemical  compounds  is  always 
possible,  or  whether  the  question  is  truly  undecidable. 


10 


Often,  the  test  of  a  canonical  ordering  scheme  is  its  ability  to 
handle  simple  undirected  graphs  for  which  atom  and  bond  differentia¬ 
tion  is  not  available  for  ordering  criteria,  typified  by  the  Lehman  example. 
Such  a  test  often  unmasks  latent  fallacies  in  canonical  ordering  pro¬ 
cedures  which  are  dependent  on  the  existence  of  atom  and  bond  qualifi¬ 
cations.  The  fact  remains,  however,  that  chemical  compounds  do 
possess  such  qualifications,  and  whether  or  not  the  inability  of  chemi¬ 
cally  oriented  canonical  ordering  schemes  to  handle  unqualified  graphs 
is  a  real  shortcoming  remains  to  be  seen. 

Penny,  in  a  recent  paper^  ,  recognizes  correctly  that  atom  and 
bonding  considerations  alone  are  in  some  cases  inadequate  for  dis¬ 
tinguishing  compounds.  His  method  is  concerned  with  enumerating 
the  simple  connectivity  in  the  neighborhood  of  each  atom.  As  he 
states,  "it  is  a  unique  expression  of  the  atomic  network  within  the 
immediate  neighborhood  of  the  subject  atom  and  is  an  attribute  of  the 
atom  as  much  as  its  chemical  indentity". 

Returning  to  the  Lehman  graph  (Figure  5),  we  can,  by  inspection, 
ascertain  the  topological  equivalency  of  nodes  1,  3,  5,  and  8,  nodes 
2  and  7,  and  nodes  4  and  6,  which  poses  the  following  problems.  Can 
this  determination  be  accomplished  algorithmically  and,  once  done, 
can  one  rank  the  groups  with  respect  to  each  other  and,  finally,  can 
one  rank  the  nodes  within  an  equivalent  set?  If  a  general  solution 
(algorithm)  for  achieving  a  canonical  form  can  be  derived  for  finite 
undirected  graphs,  there  is  no  question  of  its  applicability  to  the  more 
highly  differentiated  graph  analogs  of  chemical  compound  structures. 

Such  a  solution  would  provide  the  needed  theoretical  foundation  for  the 
development  <£  canonical  ordering  routines  for  real  compound  struc¬ 
tures. 

The  Lehman  example  rendered  in  the  Penny  notation  yields  the 


following  table  of  connectivity  codes: 

1. 

/22/22/22/ 

2. 

/22/22/22/ 

3. 

/22/22/22/ 

4. 

/22/22/22/ 

5. 

722/22/22/ 

6. 

722/22/22/ 

7. 

/22/22/22/ 

8. 

722/22/22/ 

The  undifferentiated  aspect  of  this  table  is  more  a  peculiarity  of  the 
Lehman  graph  itself  than  a  shortcoming  of  the  notation.  For  this  and 
all  similar  examples  a  more  exhaustive  approach  must  be  undertaken. 
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Using  the  Penny  concept  (less  the  notation),  each  node  of  the 
Lehman  graph  is  defined  to  a  depth  of  three  levels  using  the  arbitrary 
node  numbers  shown  in  Figure  5.  The  resultant  lists  are  shown  dia- 
grammatically  in  Figure  6.  In  dealing  with  connectivity  tables  of 
arbitrarily  numbered  graphs,  it  must  be  kept  in  mind  that  no  signifi¬ 
cance  attaches  to  the  particular  number  assigned  any  node.  However, 
an  evaluation  of  each  node  in  terms  of  all  other  nodes  at  a  given  level 
can  be  made  to  yield  information,  and  this  information  can  ultimately 
be  used  to  canonically  number  graphs. 

As  can  be  seen  in  Figure  6,  each  node  of  the  Lehman  graph  is 
depirted  through  three  levels  of  connectivity.  Each  node  list  is 
analyzed  according  to  the  number  of  times  a  given  node  number  repeats  at 
each  level,  and  this  information  is  summarized  in  the  "level  coincidence 
number"  or  LCN.  For  node  1,  for  example,  at  the  second  level  there 
is  an  LCN  of  4100  which  is  interpreted  as  follows:  there  are  four 
unique  node  numbers,  one  duplicate,  and  no  t  iplicates  or  quadrupli¬ 
cates  appearing  at  this  level.  The  topological  equivalency  of  nodes 
1,  3,  5,  and  8  in  the  Lehman  example  is  reflected  in  the  equivalency 
of  their  respective  level  coincidence  numbers  at  the  second  and  third 
level,  which  are  4100  and  4201  in  all  four  cases.  Equivalent  nodes  4 
and  6  have  level  coincidence  numbers  of  2200  and  4120  at  the  second 
and  third  level,  and  equivalent  nodes  2  and  6  have  level  coincidence 
numbers  of  6000  and  6000  at  the  second  and  third  level. 

Further  experiments  have  demonstrated  the  utility  of  the  level 
coincidence  number  as  a  measure  of  the  degree  of  connectivity  or 
"imbeddedness"  of  the  nodes  of  a  graph.  As  a  result,  we  have  identi¬ 
fied  three  classes  of  graphs; 

Class  1  -  Differentiated,  i.  e.  ,  a  graph  in  which  each  node 

enjoys  a  unique  degree  of  imbeddedness. 

Class  2  -  Mixed,  i.  e.  ,  a  graph  in  which  some  nodes  enjoy 
the  same  degree  of  imbeddedness. 

Class  3  -  Undifferentiated,  i.  e.  ,  a  graph  in  which  every 
node  enjoys  identical  imbeddedness. 

Shown  below  in  Figure  7  are  three  graphs  which  demonstrate 
respectively  the  Class  1,  Class  2,  and  Class  3  distinctions.  It  should 
be  noted  that  graphs  B  and  C  obtain  different  classes  even  though  both 
are  eight-node  graphs  regular  of  degree  three. 
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Figure  7.  Examples  of  Basic  Graph  Types 

The  development  of  the  L.CN  does  not,  in  itself,  solve  the  problem 
of  canonically  ordering  the  nodes  of  a  graph  since  there  may  be  dupli¬ 
cate  LCN's  (Class  2)  or  the  graph  may  consist  entirely  of  nodes  with 
equivalent  LCN's  (Class  3).  For  these  cases,  the  experimental 
algorithms  in  Figures  8  and  9  have  been  devised.  The  algorithm  in 
Figure  9  is  concerned  solely  with  canonically  ordering  Class  3  graphs 
and  resolving  ties  in  Class  2  graphs.  The  algorithm  in  Figure  8, 
using  the  alg  rithm  in  Figure  9  as  a  subroutine,  is  addressed  to  the 
problem  of  canonically  ordering  any  finite  undirected  graph  without 
loops  and  multiplicity  edges. 

Shown  below  in  Figure  10  are  the  planar  representations  of  two 
Class  3  graphs  representing  the  nodes  of  a  cube  and  a  dodecahedron, 
respectively.  Both  of  these  have  been  canonically  numbered  using  the 
algorithm  in  Figure  8.  These  and  other  Class  3  examples  have  been 
tested  by  furnishing  the  unnumbered  graphs  and  the  algorithms  to 
clerks  for  numbering  and  verifying  their  equivalency. 
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Figure  8.  Experimental  Algorithm  for  Canonically  Ordering  Finite  Undirected  Graphs  With  No  Loop 


\ 


Figure  9.  Experimental  Algorithm  for  Canonically  Ordering  Topologically  Equivalent  Nodes  of  Finite 

Undirected  Graphs  With  No  Loops 


Section  VI.  OTHER  CONSIDERATIONS 


Results  obtained  thus  far  on  the  problem  of  graph  isomorphism 
suggest  further  investigation  of  the  following  considerations: 

1.  Given  a  graph  of  n-nodes,  what  is  the  maximum  level  to  which 
one  must  descend  in  the  development  of  LCN's  to  determine  the  exist¬ 
ence  of  a  Class  3  graph? 

2.  In  the  iteration  in  Algorithm  7  to  decide  which  node  is  nearest 
the  node  LAN  -  (k),  is  there  a  limit  on  (k)?  Experience,  thus  far, 
indicates  that  a  selection  is  made  at  either  LAN  -  (1)  or  LAN  -  (2)  or 

1.  lection  is  madt',LAN  -(k)  goes  to  1,.  and  the  selection  is  arbitrary. 

3.  In  the  development  of  the  LCN's,  what  is  the  highest  n-tuplicate 
OIK  may  expect  at  a  given  level  for  a  graph  of  n-nodes? 

4.  In  some  respects,  the  procedure  for  numbering  Class  3  graphs 
is  analogous  to  removing  nodes  as  they  are  n  mbered.  Can  an 
algorithm  be  devised  operating  on  this  principle  of  removing  nodes 
(and  edges)  and  numbering  the  remaining  graph? 

5.  The  relative  "imbeddedness"  of  nodes  may  be  ascertained  by 
tabulating  the  number  of  numbers  appearing  at  a  specified  level.  Is 
this  analogous  to  finding  the  center  (s)  of  a  graph?  This  information 
is  carried  in  tiie  LCN  and  is  derivable  by  the  following  computation: 
given  an  LCN  of  4201  it  yields  an  imbeddedness  value  calculated  as 
follows  (4x1)  +  (2x2)  +  (0)  +  (4x1). 

6.  Can  a  simple  method  be  devised  through  analysis  of  raw 
level  numbers  (or  their  LCN's)  to  determine  if  a  graph  is  planar  or 
nonplanar  ? 
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